Author Identification on the Large Scale

نویسندگان

  • David Madigan
  • Alexander Genkin
  • David D. Lewis
  • Shlomo Argamon
  • Dmitriy Fradkin
  • Li Ye
چکیده

Individuals have distinctive ways of speaking and writing, and there exists a long history of linguistic and stylistic investigation into authorship attribution. In recent years, practical applications for authorship attribution have grown in areas such as intelligence (linking intercepted messages to each other and to known terrorists), criminal law (identifying writers of ransom notes and harassing letters), civil law (copyright and estate disputes), and computer security (tracking authors of computer virus source code). This activity is part of a broader growth within computer science of identification technologies, including biometrics (retinal scanning, speaker recognition, etc.), cryptographic signatures, intrusion detection systems, and others. Automating authorship attribution promises more accurate results and objective measures of reliability, both of which are critical for legal and security applications. Recent research has used techniques from machine learning [3, 10, 13, 31, 50], multivariate and cluster analysis [24, 25, 8], and natural language processing [5, 46] in authorship attribution. These techniques have also been applied to related problems such as genre analysis [4, 1, 6, 17, 23, 46] and author profiling (such as by gender [2, 12] or personality [38]). Our focus in this paper is on techniques for identifying authors in large collections of textual artifacts (e-mails, communiques, transcribed speech, etc.). Our approach focuses on very high-dimensional, topic-free document representations and particular attribution problems, such as: (1) Which one of these K authors wrote this particular document? (2) Did any of these K authors write this particular document? Scientific investigation into measuring style and authorship of texts goes back to the late nineteenth century, with the pioneering studies of Mendenhall [36] and Mascol [34, 35] on distributions of sentence and word lengths in works of literature and the gospels of the New Testament. The underlying notion was that works by different authors are strongly distinguished by quantifiable features of the text. By the mid-twentieth century, this line of research had grown into what became known as “stylometrics”, and a variety of textual statistics had been proposed to quantify textual style. The style of early work was characterized by a search for invariant properties of textual statistics, such as Zipf’s distribution and Yule’s K statistic.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Author gender identification from text using Bayesian Random Forest

Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...

متن کامل

Large Scale Experiments Data Analysis for Estimation of Hydrodynamic Force Coefficients

This paper describes the various frequency domain methods which may be used to analyze experiments data on the force experienced by a circular cylinder in wave and current to estimate drag and inertia coefficients for use in Morison’s equation. An additional approach, system identification techniques (SIT) is also introduced. A set of data obtained from experiments on heavily roughened circular...

متن کامل

A Variable Structure Observer Based Control Design for a Class of Large scale MIMO Nonlinear Systems

This paper fully discusses how to design an observer based decentralized fuzzy adaptive controller for a class of large scale multivariable non-canonical nonlinear systems with unknown functions of subsystems’ states. On-line tuning mechanisms to adjust both the parameters of the direct adaptive controller and observer that guarantee the ultimately boundedness of both the tracking error and tha...

متن کامل

Identification of Pattern used in Determination of Critical Success Factors in ITS Projects, Case Study: Road Maintenance and Transportation Organization

One of the risks recognized by relevant authorities is the risk of outsourcing ITS projects. The purpose of this study was to design and explain the pattern of determining the critical success factors in outsourcing large-scale ITS projects in the Ministry of Roads and Urban Development (Road Maintenance and Transportation Organization). This study was performed using qualitative method. The pa...

متن کامل

An Efficient Data Replication Strategy in Large-Scale Data Grid Environments Based on Availability and Popularity

The data grid technology, which uses the scale of the Internet to solve storage limitation for the huge amount of data, has become one of the hot research topics. Recently, data replication strategies have been widely employed in distributed environment to copy frequently accessed data in suitable sites. The primary purposes are shortening distance of file transmission and achieving files from ...

متن کامل

Decentralized Model Reference Adaptive Control of Large Scale Interconnected Systems with Time-Delays in States and Inputs

This paper investigates the problem of decentralized model reference adaptive control (MRAC) for a class of large scale systems with time varying delays in interconnected terms and state and input delays. The upper bounds of the interconnection terms are considered to be unknown. Time varying delays in the nonlinear interconnection terms are bounded and nonnegative continuous functions and thei...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005